-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-8446: MCO-503: daemon: have a special path to sync in certs #3575
OCPBUGS-8446: MCO-503: daemon: have a special path to sync in certs #3575
Conversation
@yuqi-zhang: This pull request references MCO-503 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Skipping CI for Draft Pull Request. |
/test all |
Will update the commit messages, but the functionality should be ready for review. Seems to (mostly) be working locally, let's see how it fares against CI |
pkg/daemon/certificate_writer.go
Outdated
|
||
kubeAPIServerServingCABytes := controllerConfig.Spec.KubeAPIServerServingCAData | ||
|
||
if err := writeFileAtomicallyWithDefaults(caBundleFilePath, kubeAPIServerServingCABytes); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means we're rewriting the file even if it hasn't changed, but something else in the controllerconfig did? That may be OK for now. But I am sure if this approach was to be extended at all beyond this we'd want to enhance things.
I've been thinking about things like this in the context of containers/bootc#22 and if we represented this certificate as a bootc configmap, the design I have in mind for the bootc configmap support would use ostree as a backend, and inherently hash its inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can certainly make it more nuanced. This is more of a hacky way to get us past the issues highlighted in https://issues.redhat.com/browse/MCO-499, with a better way to do it in 4.14
@yuqi-zhang: This pull request references MCO-503 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
912c4d4
to
f77a37f
Compare
added a comment and annotation for easier troubleshooting if something were to go wrong |
gcp-op failed since the actual on-disk state was not yet synced, so added a wait there. I will clean up the commit messages and squash once we decide if this is acceptable as a workaround for 4.13 |
dn.ccQueue.AddRateLimited(key) | ||
} | ||
|
||
func (dn *Daemon) syncControllerConfigHandler(key string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: Will this run outside of the usual dn.update()
process? Also, could it potentially write different content to caBundleFilePath
than the MachineConfig contains?
I ask these questions because of Config Drift Monitor. If the file is part of the MachineConfig, Config Drift Monitor will expect its contents to match. It sounds like at a minimum, Config Drift Monitor should ignore caBundleFilePath
since we're managing it here.
EDIT: Since you added a guard clause to checkV3Files()
to ignore it, that will handle the Config Drift Monitor case since Config Drift Monitor basically runs checkV3Files()
in response to any filesystem events that target files specified in the MachineConfig.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, i explicitly added 2 clauses in validation and writes so this would happen completely separately. This does of course cause 2 issues implicitly:
- the config drift monitor keeps triggering whenever the file changes, see nothing changed, might be confused, but then lets it carry on
- the node still says it updates (rebootless), performs the update, but doesn't touch the file (if its the only change), so it technically does a fake update
I can try to make these processes better but they don't block anything I believe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine as-is. You are correct about it not blocking anything: Config Drift Monitor runs in a separate Goroutine from the rest of the MCD and from that Goroutine, it calls validateOnDiskState()
whenever it detects a filesystem event for the files it keeps track of.
Additionally, running validateOnDiskState()
is inexpensive enough that even if it runs superfluously in response to writing to caBundleFilePath
, it won't have any noticeable impact.
Have a new routine that watches for controllerconfig changes to read the latest kubelet ca bundle, and lay that down to disk directly. File writing and validation will skip over it. When successful, the daemon will update node annotations to indicate the latest resourceVersion of the controllerconfig it read from, as well as a log. Note that due to caching, the log will appear a few times, but the additional writes should be no-ops. In the future, we should have a more comprehensive method to manage cert rotation. For now, this will allows us to bypass monthly certificate out-of-date issues.
Have the MCS always serve the latest kubelet config ca bundle to match the MCD. This will ensure any nodes joining the cluster always has the latest certificates, even if the pool is otherwise not updated or paused.
Soft revert of openshift#2802. This should no longer be needed since the MCD will always sync the cert bundle to disk. If things go wrong, the MCD should degrade.
Add a e2e test for rotating certs for a paused pool
6489989
to
e68a59f
Compare
Reorganized the commits to be more logical in separation and ordering. Also modified logic to only write if we have not acted on the particular resourceVersion |
/retest-required |
@yuqi-zhang: This pull request references Jira Issue OCPBUGS-8446, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/hold Revision e68a59f was retested 3 times: holding |
/jira refresh The requirements for Jira bugs have changed (Jira issues linked to PRs on main branch need to target different OCP), recalculating validity. |
@openshift-bot: This pull request references Jira Issue OCPBUGS-8446, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/bugzilla refresh The requirements for Bugzilla bugs have changed (BZs linked to PRs on main branch need to target different OCP), recalculating validity. |
@openshift-bot: No Bugzilla bug is referenced in the title of this pull request. Retaining the bugzilla/valid-bug label as it was manually added. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/hold cancel |
/test e2e-aws-ovn-upgrade |
/refresh |
/retest-required |
skipping optional tests |
Looks to me that e2e-aws-ovn-upgrade is broken https://issues.redhat.com/browse/TRT-897. Running other upgrade job to get better signal |
@yuqi-zhang: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
e2e-gcp-upgrade test has passed. Overriding e2e-aws-ovn-upgrade test which is red for a while (which we don't know the root cause yet, possibly https://issues.redhat.com/browse/TRT-897 ?) to get this merged. |
@sinnykumari: /override requires failed status contexts, check run or a prowjob name to operate on.
Only the following failed contexts/checkruns were expected:
If you are trying to override a checkrun that has a space in it, you must put a double quote on the context. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/override ci/prow/e2e-aws-ovn-upgrade |
@sinnykumari: Overrode contexts on behalf of sinnykumari: ci/prow/e2e-aws-ovn-upgrade In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Thanks! /cherry-pick release-4.13 |
@yuqi-zhang: once the present PR merges, I will cherry-pick it on top of release-4.13 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@yuqi-zhang: Jira Issue OCPBUGS-8446: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-8446 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@yuqi-zhang: new pull request created: #3612 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Have a new routine that watches for controllerconfig changes to read the latest kubelet ca bundle, and lay that down to disk directly. File writing and validation will skip over it.
This is still WIP but the functionality should work. You can try this by pausing the pools and forcing a rotation of any of the certificates in the bundle. Although it won't be instant, the MCD will sync it once the safety controllerconfig sync happens.
I tried doing this via an additional configmap but it feels like this is a bit cleaner, and allows the MCD to have limited permissions in what it can read.
Note that this is only a temporary 4.13 solution, with future steps outlined in https://issues.redhat.com/browse/MCO-499 for a more comprehensive solution